46 research outputs found
VITON: An Image-based Virtual Try-on Network
We present an image-based VIirtual Try-On Network (VITON) without using 3D
information in any form, which seamlessly transfers a desired clothing item
onto the corresponding region of a person using a coarse-to-fine strategy.
Conditioned upon a new clothing-agnostic yet descriptive person representation,
our framework first generates a coarse synthesized image with the target
clothing item overlaid on that same person in the same pose. We further enhance
the initial blurry clothing area with a refinement network. The network is
trained to learn how much detail to utilize from the target clothing item, and
where to apply to the person in order to synthesize a photo-realistic image in
which the target item deforms naturally with clear visual patterns. Experiments
on our newly collected Zalando dataset demonstrate its promise in the
image-based virtual try-on task over state-of-the-art generative models
IMAGE AND VIDEO UNDERSTANDING WITH CONSTRAINED RESOURCES
Recent advances in computer vision tasks have been driven by high-capacity deep neural networks, particularly Convolutional Neural Networks (CNNs) with hundreds of layers trained in a supervised manner. However, this poses two significant challenges: (1) the increased depth in CNNs that leads to significant improvements over competitive benchmarks at the same time, limits their deployment in real-world scenarios due to high computational cost, (2) the need to collect millions of human labeled samples for training prevents such approaches to scale, especially for fine-grained image understanding like semantic segmentation, where dense annotations are extremely expensive to obtain. To mitigate these issues, we focus on image and video understanding with constrained resources, in the forms of computational resources and annotation resources. In particular, we present approaches that (1) investigate dynamic computation frameworks which adaptively allocate computing resources on-the-fly given a novel image/video to manage the trade-off between accuracy and computational complexity; (2) derive robust representations with minimal human supervision through exploring context relationships or using shared information across domains.
With this in mind, we first introduce BlockDrop, a conditional computation approach that learns to dynamically choose which layers of a deep network to execute during inference so as to best reduce total computation without degrading prediction accuracy. Next, we generalize the idea of conditional computation of images to videos by presenting AdaFrame, a framework that adaptively selects relevant frames on a per-input basis for fast video recognition. AdaFrame assumes access to all frames in videos, and hence can be only used in offline settings. To mitigate this issue, we introduce LiteEval, a simple yet effective coarse-to-fine framework for resource efficient video recognition, suitable for both online and offline scenarios.
To derive robust feature representations with limited annotation resources, we first explore the power of spatial context as a supervisory signal for learning visual representations. In addition, we also propose to learn from synthetic data rendered by modern computer graphics tools, where ground-truth labels are readily available. We propose Dual Channel-wise Alignment Networks (DCAN), a simple yet effective approach to reduce domain shift at both pixel-level and feature-level, for unsupervised scene adaptation
Learning Fashion Compatibility with Bidirectional LSTMs
The ubiquity of online fashion shopping demands effective recommendation
services for customers. In this paper, we study two types of fashion
recommendation: (i) suggesting an item that matches existing components in a
set to form a stylish outfit (a collection of fashion items), and (ii)
generating an outfit with multimodal (images/text) specifications from a user.
To this end, we propose to jointly learn a visual-semantic embedding and the
compatibility relationships among fashion items in an end-to-end fashion. More
specifically, we consider a fashion outfit to be a sequence (usually from top
to bottom and then accessories) and each item in the outfit as a time step.
Given the fashion items in an outfit, we train a bidirectional LSTM (Bi-LSTM)
model to sequentially predict the next item conditioned on previous ones to
learn their compatibility relationships. Further, we learn a visual-semantic
space by regressing image features to their semantic representations aiming to
inject attribute and category information as a regularization for training the
LSTM. The trained network can not only perform the aforementioned
recommendations effectively but also predict the compatibility of a given
outfit. We conduct extensive experiments on our newly collected Polyvore
dataset, and the results provide strong qualitative and quantitative evidence
that our framework outperforms alternative methods.Comment: ACM MM 1
Multi-Prompt Alignment for Multi-source Unsupervised Domain Adaptation
Most existing methods for multi-source unsupervised domain adaptation (UDA)
rely on a common feature encoder to extract domain-invariant features. However,
learning such an encoder involves updating the parameters of the entire
network, which makes the optimization computationally expensive, particularly
when coupled with min-max objectives. Inspired by recent advances in prompt
learning that adapts high-capacity deep models for downstream tasks in a
computationally economic way, we introduce Multi-Prompt Alignment (MPA), a
simple yet efficient two-stage framework for multi-source UDA. Given a source
and target domain pair, MPA first trains an individual prompt to minimize the
domain gap through a contrastive loss, while tuning only a small set of
parameters. Then, MPA derives a low-dimensional latent space through an
auto-encoding process that maximizes the agreement of multiple learned prompts.
The resulting embedding further facilitates generalization to unseen domains.
Extensive experiments show that our method achieves state-of-the-art results on
popular benchmark datasets while requiring substantially fewer tunable
parameters. To the best of our knowledge, we are the first to apply prompt
learning to the multi-source UDA problem and our method achieves the highest
reported average accuracy of 54.1% on DomainNet, the most challenging UDA
dataset to date, with only 15.9M parameters trained. More importantly, we
demonstrate that the learned embedding space can be easily adapted to novel
unseen domains
Recognizing Instagram Filtered Images with Feature De-stylization
Deep neural networks have been shown to suffer from poor generalization when
small perturbations are added (like Gaussian noise), yet little work has been
done to evaluate their robustness to more natural image transformations like
photo filters. This paper presents a study on how popular pretrained models are
affected by commonly used Instagram filters. To this end, we introduce
ImageNet-Instagram, a filtered version of ImageNet, where 20 popular Instagram
filters are applied to each image in ImageNet. Our analysis suggests that
simple structure preserving filters which only alter the global appearance of
an image can lead to large differences in the convolutional feature space. To
improve generalization, we introduce a lightweight de-stylization module that
predicts parameters used for scaling and shifting feature maps to "undo" the
changes incurred by filters, inverting the process of style transfer tasks. We
further demonstrate the module can be readily plugged into modern CNN
architectures together with skip connections. We conduct extensive studies on
ImageNet-Instagram, and show quantitatively and qualitatively, that the
proposed module, among other things, can effectively improve generalization by
simply learning normalization parameters without retraining the entire network,
thus recovering the alterations in the feature space caused by the filters.Comment: Accepted in AAAI 2020 as an oral presentation pape